WGS Upscaling - IT & Bioinformatics Evaluation

Data transfer, Data storage, Bioinformatics pipeline capacity
Author
Affiliation

GDx

Published

December 7, 2023

1 Background

GDx at OUSAMG is planning to upscale the WGS production to 4 x 48 samples or 2 x 48 + 1 x 96 samples per week.

This document evaluates the possible bottlenecks of IT & bioinformatics pipelines in following areas:

  1. Data transfer speed
  2. Data storage
  3. Pipeline capacity (Illumina DRAGEN)

2 IT && Bioinformatics

2.1 Data transfer speed

2.1.1 Collect Historical Data Transfer Records

To evaluate the data transfer speed, we collected the transfer time of all files that were transferred from NSC to TSD between 2023-09-01 08:41:40 and 2023-11-28 13:26:06 from the nsc-exporter. The nsc-exporter is the tool that is used to transfer data from NSC to TSD.

         [,1]                                            
datetime "2023-11-23 08:23:11"                           
project  "wgs334"                                        
filename "Diag-wgs334-HG64374875-DR.alignedMTshifted.bam"
bytes    "36033304"                                      
seconds  "0.8"                                           
speed    "40520000"                                      
         [,1]                                                 
datetime "2023-11-09 15:44:32"                                
project  "wgs329"                                             
filename "HG70239620-Kortvoksthet-KIT-wgs_S47_R2_001.fastq.gz"
bytes    "41378477732"                                        
seconds  "474.7"                                              
speed    "83120000"                                           
         [,1]                                     
datetime "2023-11-13 14:35:56"                    
project  "wgs330"                                 
filename "manta_Diag-wgs330-HG54343081-PK_std.vcf"
bytes    "6528689"                                
seconds  "0.2"                                    
speed    "38360000"                               
         [,1]                                
datetime "2023-09-30 17:52:55"               
project  "wgs315"                            
filename "Diag-wgs315-HG62403783C8942.sample"
bytes    "1564"                              
seconds  "0"                                 
speed    "60600"                             
         [,1]                                                                            
datetime "2023-10-23 19:52:17"                                                           
project  "wgs321"                                                                        
filename "231016_A00943_0761_BHJ7C3DSX7.HG26010946-Hjernekanal-KIT-wgs_S16_R1_001.qc.pdf"
bytes    "118178"                                                                        
seconds  "0.1"                                                                           
speed    "1633530"                                                                       
The nsc-exporter log files and the sequencer overview html files were ignored for simplicity. 1

2.1.2 Data Overview

The file size of the collected data ranges from 0.0 B to 100.9 GiB. The average file size is 1.5 GiB. The median file size is 9.3 KiB. The standard deviation is 8.1 GiB.

         filesize
Min.        0.0 B
1st Qu.   428.0 B
Median    9.3 KiB
Mean      1.5 GiB
3rd Qu. 968.0 KiB
Max.    100.9 GiB

The transfer speed ranges from 1.0 B to 93.1 MiB. The average transfer speed is 12.2 MiB. The median transfer speed is 288.1 KiB. The standard deviation is 23.4 MiB.

        speed(/s)
Min.        1.0 B
1st Qu.  12.0 KiB
Median  288.1 KiB
Mean     12.2 MiB
3rd Qu.   8.4 MiB
Max.     93.1 MiB

The transfer time ranges from 0 seconds to 2084.4 seconds. The average transfer time is 19.6 seconds. The median transfer time is 0 seconds. The standard deviation is 104.3 seconds.

    seconds       
 Min.   :   0.00  
 1st Qu.:   0.00  
 Median :   0.00  
 Mean   :  19.56  
 3rd Qu.:   0.10  
 Max.   :2084.40  

2.1.3 Correlation Between File Size And Transfer Time And Transfer Speed

2.1.3.1 Transfer speed and time VS file size (all files)

Small files have lower transfer speed. A good transfer speed around 80 MB/s is achieved for files larger than 30 GB.

Figure 1: Transfer speed VS file size (all files)
Figure 2: Transfer time VS file size (all files)

2.1.3.2 Transfer speed and time VS file size (small files)

Although the transfer speed of small files are very low; the transfer time is usually very short. So small files are not the bottleneck of the data transfer.

Figure 3: Transfer speed VS file size (small files)
Figure 4: Transfer time VS file size (small files)

2.1.3.3 Maximum transfer reached around 200MB file size?

Small files have lower transfer speed. Large files have higher transfer speed. But it looks like best transfer speed is observed for files with sizearound 200 MB file size.

Figure 5: maximum transfer speed reached around 200MB file size

2.1.4 Idle Time

To evaluate whether there is capacity for upscaling, we need to know the idle time of the nsc-exporter. The nsc-exporter is idle when it is not transferring data.

All transfer records are plotted with starting time of each transfer on x-axis and the time used to finished the transfer on y-axis. The gaps represnts idle periods of nsc-exporter. The color represents projects, e.g. wgs123, EKG20230901 etc.. The shape represents project type, e.g. wgs, EKG etc. You can turn off a project by clicking it in the legend to the right of the figure.

For easier visualization, the data is grouped in months.

2.1.4.1 September

Figure 6: Idle time in September
Figure 7: Idle time in September (logarithmic time)

2.1.4.2 October

Figure 8: Idle time in October
Figure 9: Idle time in October (logarithmic time)

2.1.4.3 November

Figure 10: Idle time in November
Figure 11: Idle time in November (logarithmic time)

2.1.5 Discussion

2.1.5.1 Do we transfer too many small files?

Figure 12: total number of small files vs large files
Figure 13: total time used for transferring small files
threshold small files large files
1 <100 kB 1892 2516443
2 <1 MB 4026 2514309
3 <10 MB 6528 2511807
4 <100 MB 10266 2508069
5 <1 GB 36881 2481455

2.1.5.2 Possibility Of One More 48-sample Run Per Week

  • The nsc-exporter is idle for quite a portion of the time.
    • Quite long idle time in September observed Figure 6.
    • Almost 12 wgs projects were transferred in November.
  • The maximum transfer speed is reached around 200 MB file size Figure 5. This is the configured chunk size of s3cmd which is the tool used by nsc-exporter for data transfer. We might want to increase the chunk size to improve the transfer speed?
  • The current transfer speed is not optimal considering the 10Gbps switch connecting NSC and TSD. We need to investigate the reason for the low transfer speed.

2.1.6 Conclusion

  • We might be able to run 4 x 48 or 2 x 48 + 1 x 96 samples per week with the current transfer speed. However, we will reach maximum capicity of data transfer.
  • If we can increase the transfer speed, e.g. reaching 200MB/s, we can easily double current production capacity.

2.2 Data storage

WGS produces large amount of data. The data storage capacity is critical for the upscaling.

2.2.1 NSC

On NSC side, the data is stored in on boston at /boston/diag. Boston has a total capacity of 1.5 PB, and the usable capacity is 1.2 at the moment.

2.2.2 TSD

On TSD side, the data is stored in /cluster/projects/p22. The total capacity is 1.8 PB, and the usable capacity is 1.2 PB at the moment.

2.3 Pipeline capacity (Illumina DRAGEN)

Illunima DRAGEN is a bioinformatics pipeline server that can be used to process WGS data. It takes around 1 hours to process a 30x WGS sample.

3 Discussion

To be addded…

4 Conclusion

To be added…

Footnotes

  1. The nsc-exporter log and sequencer overview html files are very small files and do not belong to any projects. They are always transferred in a very short time. They will not affect the transfer speed of other files. Therefore, they are ignored for simplicity.↩︎